Regression_Step 01_02¶

Abstract¶

Machine learning applications have evolved in the fields of ranging from Natural Language Processing to the vast majority of business functions, one of which is taking place in a real estate field. While machine learning does not take over the real estate professinoals tasks, it will help people perform complicated task especially realted to a realtively large amount of data.

Objectives¶

This project focuses on quantitative data science and mainly aims to perform two goals. One is to predict housing price based on multiple variables such as room size or building types etc. The other is to forecast sales of a real estate company.

Regression_Step 02 Data Wrangling¶

Source of a data set¶

Source: Kaggle page https://www.kaggle.com/ruiqurm/lianjia

This includes URL, ID, Lng, Lat, CommunityID (Cid), TradeTime, DOM(days on market), Followers, Total price, Price, Square, Living Room, number of Drawing room, Kitchen and Bathroom, Building Type, Construction time. renovation condition, building structure, Ladder ratio( which is the proportion between number of residents on the same floor and number of elevator of ladder. It describes how many ladders a resident have on average), elevator, Property rights for five years（It's related to China restricted purchase of houses policy), Subway, District, Community average price.

All the data was coming from https://bj.lianjia.com/chengjiao.

'price' in this data seems what is known as unit price, which should equal to totalPrice/square.

Categories Assignement¶

According to this page, the category labels can be decoded as foloowing, but in this project we go ahead with the original encoded dataset.

(buildingType == 1 ~ "Tower", buildingType == 2 ~ "Bungalow", buildingType == 3 ~ "Plate/Tower", buildingType == 4 ~ "Plate")

(renovationCondition == 1 ~ "Other", renovationCondition == 2 ~ "Rough", renovationCondition == 3 ~ "Simplicity", renovationCondition == 4 ~ "Hardcover")

(buildingStructure == 1 ~ "Unavailable", buildingStructure == 2 ~ "Mixed", buildingStructure == 3 ~ "Brick/Wood", buildingStructure == 4 ~ "Brick/Concrete", buildingStructure == 5 ~ "Steel", buildingStructure == 6 ~ "Steel/Concrete")

(elevator == 1 ~ "Has_Elevator", elevator != 1 ~ "No_elevator"))

(subway == 1 ~ "Has_Subway", subway != 1 ~ "No_Subway")

(fiveYearsProperty == 1 ~ "Ownership < 5y", fiveYearsProperty != 1 ~ "Ownership > 5y")

(district == 1 ~ "DongCheng", district == 2 ~ "FengTai", district == 3 ~ "DaXing", district == 4 ~ "FaXing", district == 5 ~ "FangShang", district == 6 ~ "ChangPing", district == 7 ~ "ChaoYang", district == 8 ~ "HaiDian", district == 9 ~ "ShiJingShan", district == 10 ~ "XiCheng", district == 11 ~ "TongZhou", district == 12 ~ "ShunYi", district == 13 ~ "MenTouGou"))

Introduction¶

This notebook is data wrangling part of a self projects of Housing price in Beijing.

Imports¶

For data import, cleaning, and visualization, import pandas, numpy, scipy, random and matplotlib.pyplot

For machine learning from sklearn, import LinearRegression

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import clear_output

import os
from library.sb_utils import save_file

Other libraries and fnuciton will be imported as necessary.

Configuration¶

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.options.display.float_format = '{:.4f}'.format

Objective¶

The goals of this project are followings.

Goal 1: Prediction model for housing price
Goal 2: Forecast model for sales

In this notebook, we wrangle data and make data ready for EDA, which will be the next step.

Load Data¶

The csv file of the raw data has already been downloaded from the Kaggle website. The file is located in the folder named raw_data.

They include a column of datetime type, 'tradeTime'(yyyy-mm-dd). This will be converted to datatime type. As for the encoding, we go on with the 'iso-8859-1'.

Here, we download the file with the specification mentoined above as df. Not only this, but we also keep original dataframe in order to make it easier to check and compare, which is named as original_df.

df = pd.read_csv('../raw_data/Housing price in Beijing.csv',encoding='iso-8859-1',parse_dates = ['tradeTime'])
original_df = df.copy()
df.sort_values('tradeTime',inplace=True)

This warning says there are some mixed data types in the columns and we will be aware of this when cleaning.

Explore The Data for cleaning ¶

Display general information with info method to check #rows, column names, Null value, Dtype.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 318851 entries, 94502 to 90666
Data columns (total 26 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   url                  318851 non-null  object        
 1   id                   318851 non-null  object        
 2   Lng                  318851 non-null  float64       
 3   Lat                  318851 non-null  float64       
 4   Cid                  318851 non-null  int64         
 5   tradeTime            318851 non-null  datetime64[ns]
 6   DOM                  160874 non-null  float64       
 7   followers            318851 non-null  int64         
 8   totalPrice           318851 non-null  float64       
 9   price                318851 non-null  int64         
 10  square               318851 non-null  float64       
 11  livingRoom           318851 non-null  object        
 12  drawingRoom          318851 non-null  object        
 13  kitchen              318851 non-null  int64         
 14  bathRoom             318851 non-null  object        
 15  floor                318851 non-null  object        
 16  buildingType         316830 non-null  float64       
 17  constructionTime     318851 non-null  object        
 18  renovationCondition  318851 non-null  int64         
 19  buildingStructure    318851 non-null  int64         
 20  ladderRatio          318851 non-null  float64       
 21  elevator             318819 non-null  float64       
 22  fiveYearsProperty    318819 non-null  float64       
 23  subway               318819 non-null  float64       
 24  district             318851 non-null  int64         
 25  communityAverage     318388 non-null  float64       
dtypes: datetime64[ns](1), float64(11), int64(7), object(7)
memory usage: 65.7+ MB

The missing values are found in some columns, 6,16,11-23,25. These will be fixed later.

df.describe().T

Check the first 5 rows to visually check the data itself.

df.head()

All the data seem ok except the floor column, which seems to have two types of information, forward part and aftward part. The forward part of the floor data column is unreadable for some reasons, but we can still use one as categorise. The aftward part of the floor data column looks perfect.

Data Cleaning ¶

Overview¶

Based on the df.info(), some problems have been found. In this section, we will deal with the error one by one.

In order to make comaprison easier, extract year and month in differect columns respectively.

df['year'] = df.tradeTime.apply(lambda x: x.year)
df['month'] = df.tradeTime.apply(lambda x: x.month)
df.sort_values('tradeTime',inplace=True)

'bathRoom' column: Comapare correct and wrong data¶

First error found is the data in the wrong columns. For example, the bathroom has some unreasonable data set.

df.bathRoom.unique()

array([1, 2, 2006, 2003, 1990, 2000, '1', 1996, 2005, 4, 3, '2', '0', '3',
       0, 5, 2011, 2004, 6, 1994, 'Î´Öª', '2003', '4', '5', '7', 7],
      dtype=object)

Now we explore what wrong is going in 'bathRoom' column. As we can see, 'bathRoom' column has 2006, 2003 etc which seem to be year. However, these 2003 and 2006 are supposed to be seen only in constructionTime column.

To see the details, we now compare correct data using sample of (bathRoom == 6 or 7) and the wrong data (bathRoom == 1990)

df[(df['bathRoom']==1990)|(df['bathRoom']==6)|(df['bathRoom']==7)].head(4)

We also found that the rows that have wrong data in bathrom also include another wrong data, #NAME? in livingRoom column.

Now we check the rows with #NAME? in livingRoom columns.

df[df.livingRoom=='#NAME?'].head()

len(df[df.livingRoom=='#NAME?'])

32

Based on this result, we found that 32 rows of livingRoom=='#NAME?' data, and also found that not only bathRoom but also the other columns upto ladderRatio column also have wrong data that should have been in a column by 3 to the right. Based on this observation, one solution to this problems will be following. regarding the rows with 'year' or any objects other than the integers less than 100 in 'bathRoom' column (as there will be no house with 100 bathrooms), it may be an option to shift these wrong data, horizontally by 3 columns to the right. The columns to shift are from livingRoom to ladderRatio.

Another problem found is that the original data in the buildingType seems actually the data that should have been ladderRatio, which is FOUR columns to the right.

Although there may be some solution to save these 32 rows of data, it will be difficult to presume which column is right location to shift the wrong data to. In addition, the 32 rows of data are relatively quite small compared to entire rows of 318850. Furthermore, it will be difficult to predict the missing values which will be created by shifting to salvge 32 rows. Our conclusion is to drop the 32 rows that have livingRoom=='#NAME?'.

df = df[df.livingRoom!='#NAME?']

Separate Categorical feature ¶

Original 'floortype' column has two categorical features in the columns Separate them into two of new columns.

df['floor'].unique()

array(['ÖÐ 9', '¸ß 26', '¸ß 21', 'ÖÐ 28', '¸ß 20', 'ÖÐ 6', '¸ß 12',
       '¸ß 6', 'µ× 6', 'ÖÐ 11', 'ÖÐ 16', 'µÍ 6', 'ÖÐ 14', 'µÍ 14', '¶¥ 6',
       '¸ß 15', 'µÍ 13', '¸ß 14', '¶¥ 16', 'µÍ 16', 'ÖÐ 13', '¶¥ 15',
       'µÍ 18', 'µ× 18', 'µÍ 12', 'µÍ 15', '¶¥ 11', 'ÖÐ 15', '¶¥ 5',
       '¸ß 16', '¶¥ 13', 'µ× 12', 'ÖÐ 5', 'ÖÐ 12', 'ÖÐ 18', '¸ß 13',
       '¸ß 28', '¶¥ 12', 'µ× 5', 'ÖÐ 31', 'µ× 7', 'ÖÐ 29', 'ÖÐ 20',
       '¶¥ 10', '¸ß 19', 'ÖÐ 24', 'µÍ 17', 'ÖÐ 7', '¸ß 24', '¸ß 11',
       'ÖÐ 26', 'ÖÐ 25', '¸ß 23', 'Î´Öª 6', '¸ß 18', 'µÍ 21', 'µÍ 28',
       '¶¥ 14', 'µÍ 24', 'ÖÐ 21', 'µÍ 22', '¸ß 17', 'Î´Öª 5', 'ÖÐ 8',
       'µÍ 29', 'µ× 32', '¸ß 7', 'ÖÐ 4', 'ÖÐ 27', 'µÍ 11', 'µ× 10',
       'ÖÐ 17', 'µÍ 19', 'µÍ 7', 'ÖÐ 22', 'µÍ 32', '¶¥ 9', 'µÍ 10',
       '¸ß 32', '¸ß 9', 'µÍ 26', '¶¥ 28', 'ÖÐ 23', 'µ× 13', 'µÍ 30',
       'ÖÐ 32', '¶¥ 18', 'ÖÐ 19', 'Î´Öª 21', 'µ× 16', 'µÍ 9', '¸ß 29',
       '¸ß 10', 'ÖÐ 30', 'µ× 3', 'µ× 4', '¸ß 22', '¸ß 33', 'µÍ 20',
       '¸ß 8', '¸ß 27', '¸ß 25', '¶¥ 7', 'ÖÐ 10', '¶¥ 22', '¶¥ 21',
       'µ× 17', '¶¥ 19', 'µÍ 27', '¶¥ 4', 'µ× 14', 'Î´Öª 16', 'Î´Öª 20',
       'Î´Öª 27', 'µ× 11', 'µ× 20', '¶¥ 20', 'µÍ 25', 'µ× 28', 'µ× 9',
       'µÍ 31', '¸ß 31', 'µÍ 23', 'µ× 21', 'Î´Öª 25', '¶¥ 25', '¸ß 30',
       '¶¥ 24', 'Î´Öª 18', 'µÍ 8', 'µ× 1', 'µÍ 3', 'Î´Öª 23', 'Î´Öª 12',
       'µ× 2', 'Î´Öª 26', 'Î´Öª 9', '¶¥ 3', '¸ß 34', 'Î´Öª 22', 'Î´Öª 7',
       '¶¥ 2', 'µ× 25', '¶¥ 30', '¶¥ 8', 'µÍ 34', 'Î´Öª 17', 'µ× 30',
       'µ× 22', 'µ× 26', 'µ× 24', '¶¥ 17', 'µ× 15', 'µ× 8', 'ÖÐ 33',
       'Î´Öª 15', 'Î´Öª 4', 'ÖÐ 37', 'µ× 29', '¶¥ 32', 'µ× 19', 'µ× 23',
       'µÍ 33', 'Î´Öª 11', '¶¥ 26', '¶¥ 34', 'µ× 27', 'Î´Öª 10', '¶¥ 23',
       'Î´Öª 14', '¶¥ 27', '¶¥ 29', 'Î´Öª 19', 'ÖÐ 34', 'Î´Öª 8', '¶¥ 33',
       'µ× 33', '¶¥ 31', 'Î´Öª 28', 'µÍ 37', 'µÍ 42', 'µ× 31', 'Î´Öª 24',
       'Î´Öª 3', 'Î´Öª 30', 'Î´Öª 31', 'ÖÐ 36', '¸ß 37', 'ÖÐ 57',
       'Î´Öª 13', '¸ß 36', 'Î´Öª 29', 'ÖÐ 42', 'µÍ 5', 'µÍ 2', '¸ß 42',
       'µÍ 36', 'ÖÐ 35', 'µÍ 35', 'µÍ 63', 'µ× 34'], dtype=object)

def floorType(s):
    try:
        return s.split(' ')[0]
    except:
        return np.nan

def floorHeight(s):
    try:
        return int(s.split(' ')[1])
    except:
        return np.nan

df['floorType'] = df['floor'].map(floorType)
df['floorHeight'] = df['floor'].map(floorHeight)

df.head()

Confirmed floorType and floorHeight have been created and each has categorical values and numerical values, respectively.

df.drop('floor',axis=1,inplace=True)

Data type in bathRoom¶

df.bathRoom.unique()

array([1, 2, '1', 4, 3, '2', '0', '3', 0, 5, 6, '4', '5', '7', 7],
      dtype=object)

Columns of 11,12,14,15,16 have object type data and seem to have mixed datatype. We will convert them to int type.

df.iloc[:,[11,12,14]] = df.iloc[:,[11,12,14]].applymap(int)

# Confirm the result
df.bathRoom.unique()

array([1, 2, 4, 3, 0, 5, 6, 7])

Data type in floorType¶

# Check the current categorical values in floorType
df.floorType.unique()

array(['ÖÐ', '¸ß', 'µ×', 'µÍ', '¶¥', 'Î´Öª'], dtype=object)

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df.floorType = le.fit_transform(df.floorType)

# Confirm the new categorical values
df.floorType.unique()

array([5, 3, 1, 0, 2, 4])

Data type in constructionTime¶

'constructionTime''s datatype should be year or time.

df.constructionTime.unique()

array(['2009', '2002', '2001', '2005', 'Î´Öª', '1992', '2008', '1998',
       '2003', '2006', '1990', '1991', '2000', '1996', '2004', '1995',
       '1987', '1997', '1980', '1975', '1989', '2007', '2014', '1984',
       '1993', '2010', '1999', '1981', '1988', '1982', '1963', '1977',
       '1983', '1994', '1985', '1955', '1986', '2011', '1967', '1979',
       '1978', '2012', '1960', '1976', '1970', '1954', '1974', '1972',
       '1965', '1956', '1958', '2013', '1973', '2015', '1959', '1966',
       '1964', '1961', '1944', '1953', '1962', '1957', '1952', '1971',
       '1950', '1906', '1968', '1969', '2016', '1934', '1914', '1933'],
      dtype=object)

len(df[df.constructionTime=='Î´Öª']),df[df.constructionTime=='Î´Öª']['tradeTime'].apply(lambda x: x.year).unique()

(19283, array([2008, 2011, 2012, 2013, 2014, 2015, 2016, 2017]))

One unreadable str,'Î´Öª', has been found, which counted 19283. Those row have relatively new years.

Our hypothesis is that this 'Î´Öª' means new house.

Now we check the data which do not have 'Î´Öª' to compare construction year and tradeTime year.

constIsLater = df[df.constructionTime!='Î´Öª'].constructionTime.astype(int) > df[df.constructionTime!='Î´Öª']['tradeTime'].apply(lambda x: x.year)
constIsSame = df[df.constructionTime!='Î´Öª'].constructionTime.astype(int) == df[df.constructionTime!='Î´Öª']['tradeTime'].apply(lambda x: x.year)
constIsEarlier = df[df.constructionTime!='Î´Öª'].constructionTime.astype(int) < df[df.constructionTime!='Î´Öª']['tradeTime'].apply(lambda x: x.year)
print(sum(constIsLater),
      sum(constIsSame),
      sum(constIsEarlier)
     )

110 330 299096

Base on this comparison, we conclude our hyputhesis is resonable.

Now we replace 'Î´Öª' for the tradeTime year.

def _const_correction(cols):
    construction_year = cols[0]
    if construction_year=='Î´Öª':
        return int(cols[1])
    else:
        return int(cols[0])

df['constructionTime']= df[['constructionTime','year']].apply(_const_correction,axis=1)

Final check for data type¶

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 318819 entries, 94502 to 90666
Data columns (total 29 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   url                  318819 non-null  object        
 1   id                   318819 non-null  object        
 2   Lng                  318819 non-null  float64       
 3   Lat                  318819 non-null  float64       
 4   Cid                  318819 non-null  int64         
 5   tradeTime            318819 non-null  datetime64[ns]
 6   DOM                  160849 non-null  float64       
 7   followers            318819 non-null  int64         
 8   totalPrice           318819 non-null  float64       
 9   price                318819 non-null  int64         
 10  square               318819 non-null  float64       
 11  livingRoom           318819 non-null  int64         
 12  drawingRoom          318819 non-null  int64         
 13  kitchen              318819 non-null  int64         
 14  bathRoom             318819 non-null  int64         
 15  buildingType         316798 non-null  float64       
 16  constructionTime     318819 non-null  int64         
 17  renovationCondition  318819 non-null  int64         
 18  buildingStructure    318819 non-null  int64         
 19  ladderRatio          318819 non-null  float64       
 20  elevator             318819 non-null  float64       
 21  fiveYearsProperty    318819 non-null  float64       
 22  subway               318819 non-null  float64       
 23  district             318819 non-null  int64         
 24  communityAverage     318356 non-null  float64       
 25  year                 318819 non-null  int64         
 26  month                318819 non-null  int64         
 27  floorType            318819 non-null  int64         
 28  floorHeight          318819 non-null  int64         
dtypes: datetime64[ns](1), float64(11), int64(15), object(2)
memory usage: 73.0+ MB

Null values¶

Major possible options are as follows.

Drop off the rows with Null
Imute a Null call(missing value) for a value
Drop off the columns with Null

Visualise and scale the NaN values

import missingno as msno
import matplotlib.pyplot as plt

print(df.isnull().sum()/len(df)*100)
# sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

# Visualize missingness
msno.matrix(df)
plt.show()

url                    0.0000
id                     0.0000
Lng                    0.0000
Lat                    0.0000
Cid                    0.0000
tradeTime              0.0000
DOM                   49.5485
followers              0.0000
totalPrice             0.0000
price                  0.0000
square                 0.0000
livingRoom             0.0000
drawingRoom            0.0000
kitchen                0.0000
bathRoom               0.0000
buildingType           0.6339
constructionTime       0.0000
renovationCondition    0.0000
buildingStructure      0.0000
ladderRatio            0.0000
elevator               0.0000
fiveYearsProperty      0.0000
subway                 0.0000
district               0.0000
communityAverage       0.1452
year                   0.0000
month                  0.0000
floorType              0.0000
floorHeight            0.0000
dtype: float64

df.isna().sum().plot(kind="bar") plt.show()

Based on this information, there are three columns which have null data, among which the missing percentage of data in DOM column is significantly large, 49.55%. In the following sections, impute these missing data column by column.

Null in 'new_buildingStructure'¶ ¶

Replace the null data in this column for valid data.

df['buildingType'].isnull().sum()

2021

2021 is quite a lot.

df['buildingType'].unique()

array([ 4.,  1.,  3., nan,  2.])

Instead of dropping them, careate and put a new category label, 5.

df['buildingType'] = df['buildingType'].replace({np.nan:5})

df['buildingType'].unique()

array([4., 1., 3., 5., 2.])

df.isnull().sum()/len(df)*100

url                    0.0000
id                     0.0000
Lng                    0.0000
Lat                    0.0000
Cid                    0.0000
tradeTime              0.0000
DOM                   49.5485
followers              0.0000
totalPrice             0.0000
price                  0.0000
square                 0.0000
livingRoom             0.0000
drawingRoom            0.0000
kitchen                0.0000
bathRoom               0.0000
buildingType           0.0000
constructionTime       0.0000
renovationCondition    0.0000
buildingStructure      0.0000
ladderRatio            0.0000
elevator               0.0000
fiveYearsProperty      0.0000
subway                 0.0000
district               0.0000
communityAverage       0.1452
year                   0.0000
month                  0.0000
floorType              0.0000
floorHeight            0.0000
dtype: float64

Null in 'communityAverage'¶

Before imputing community average, check how community average has been calculated.

# Make a list of columns that seem related to community average.
checklist = ['communityAverage','Cid','totalPrice','price','year','month','district','buildingType','buildingStructure']

# Make a new dataframe of checklist
comAveSet = df[checklist]

# Check the max of communityAverage
comAveSet.communityAverage.max()

183109.0

# Find the constant value to search the which values are used to make a average of communityAverage
comAveSetmax = comAveSet[comAveSet.communityAverage==comAveSet.communityAverage.max()]
pd.set_option('display.float_format', lambda x: '%.2f' % x)
comAveSetmax.describe().T

# Another search
comAveSetsecmax = comAveSet[comAveSet.communityAverage==146634.0]
pd.set_option('display.float_format', lambda x: '%.2f' % x)
comAveSetsecmax.describe().T

From these analysis, community average seems the average price in a district or Cid. More importantly, community average is the value averaged over multiple years.

Firstly check # of unique data in each column.

len(df['Cid'].unique()),len(df['communityAverage'].unique()),len(df['district'].unique())

(4035, 4073, 13)

Next, check the summary for single Cid (117800835289894).

df[df.Cid==117800835289894].describe().T

The communityAverage has 4073 unique numbers, while Cid has 4035. They are very close. #Cid and #communityAverage are almost same, but district is too small comapred to 4073, #communityAverage.

In a single district, there are so many Cid. Also, although I have deleted the map analysis as the data is hundreds MB heavy, from geographical relathion between Cid and district, the broundary between district is not clearly identical as Cid boundary, which means that on the boundary of district, there are several Cid are mixed.

We conclude followings.

1. It is less likely the community average is average of a district and
1. Cid seems the most strongly related to communityAverage.
1. CommunityAverage is average values regardless years

In order to keep feature variables independent and meaningful, drop communityAverage column.

df.drop('communityAverage',inplace=True,axis=1)

# confirm the remaining null.
df.isnull().sum()/len(df)*100

url                    0.00
id                     0.00
Lng                    0.00
Lat                    0.00
Cid                    0.00
tradeTime              0.00
DOM                   49.55
followers              0.00
totalPrice             0.00
price                  0.00
square                 0.00
livingRoom             0.00
drawingRoom            0.00
kitchen                0.00
bathRoom               0.00
buildingType           0.00
constructionTime       0.00
renovationCondition    0.00
buildingStructure      0.00
ladderRatio            0.00
elevator               0.00
fiveYearsProperty      0.00
subway                 0.00
district               0.00
year                   0.00
month                  0.00
floorType              0.00
floorHeight            0.00
dtype: float64

Null in 'DOM'¶

The definition of DOM is the days on market. If the day on market are longer than other houses, customers looking for houses start to suspect there are promblems with the houses.

msno.matrix(df)

<AxesSubplot:>

The vast majority of missing values are ealier time of this data set.

df.describe().DOM

count   160849.00
mean        28.83
std         50.24
min          1.00
25%          1.00
50%          6.00
75%         37.00
max       1677.00
Name: DOM, dtype: float64

DOM seems important value, so somehow we are willing to impute the missing.

Because of much discrepancy among median, mean, and maximum value, we will further analysis, which will happen in next EDA. So, for now, we will keep this missing values in DOM column.

# df['DOM'] = df['DOM'].fillna(1.0)

# Final check
df.isnull().sum()/len(df)*100

url                    0.00
id                     0.00
Lng                    0.00
Lat                    0.00
Cid                    0.00
tradeTime              0.00
DOM                   49.55
followers              0.00
totalPrice             0.00
price                  0.00
square                 0.00
livingRoom             0.00
drawingRoom            0.00
kitchen                0.00
bathRoom               0.00
buildingType           0.00
constructionTime       0.00
renovationCondition    0.00
buildingStructure      0.00
ladderRatio            0.00
elevator               0.00
fiveYearsProperty      0.00
subway                 0.00
district               0.00
year                   0.00
month                  0.00
floorType              0.00
floorHeight            0.00
dtype: float64

Save data ¶

This should be done aftre imputation of communityAverage as there are null values in the column.

# save the data to a new csv file
datapath = '../data'
save_file(df, 'data_step2_cleaned.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/data_step2_cleaned.csv"

Summary ¶

The key processes and findings from this notebook are as follows.
- Original rows in the data: 318851 rows
- Rows deleted: 32 rows These data included 10 columns misplaced and 6 missing feature values.
- Rows after cleaning: 318819 rows
The removed columnns
- communityAverage: This column had average value over several years and had strong relationship with Cid and seems dependent on Cid and price or totalPrice. Based on our objectives in sec4, one of our goals is to analyze time series change in price. Therefore, this average price will not be necessary and has been deleted.
DOM column Although there are large amount of missing values in DOM columns, as this DOM is considered to be one of important Key Performance Indicator in real estate, we will leave out the imputation for DOM in this datawrangling and further explore in 03 EDA notebook.
Price and totalPrice As discussed above, there is suplicaiton in price information. For example, 'totalPrice' seems equal to 'price' multiplied by 'square'. Either of the columns will need to be removed, but we will discuss about this in EDA.

	count	mean	std	min	25%	50%	75%	max
Lng	318851.0000	116.4185	0.1121	116.0725	116.3450	116.4168	116.4776	116.7324
Lat	318851.0000	39.9496	0.0920	39.6270	39.8932	39.9345	40.0030	40.2528
Cid	318851.0000	1129113334722.7512	2363446816043.5889	1111027373683.0000	1111027376255.0000	1111027378407.0000	1111027380579.0000	1114619720585020.0000
DOM	160874.0000	28.8223	50.2373	1.0000	1.0000	6.0000	37.0000	1677.0000
followers	318851.0000	16.7315	34.2092	0.0000	0.0000	5.0000	18.0000	1143.0000
totalPrice	318851.0000	349.0302	230.7808	0.1000	205.0000	294.0000	425.5000	18130.0000
price	318851.0000	43530.4364	21709.0242	1.0000	28050.0000	38737.0000	53819.5000	156250.0000
square	318851.0000	83.2406	37.2347	6.9000	57.9000	74.2600	98.7100	1745.5000
kitchen	318851.0000	0.9946	0.1096	0.0000	1.0000	1.0000	1.0000	4.0000
buildingType	316830.0000	3.0098	1.2699	0.0480	1.0000	4.0000	4.0000	4.0000
renovationCondition	318851.0000	2.6063	1.3117	0.0000	1.0000	3.0000	4.0000	4.0000
buildingStructure	318851.0000	4.4510	1.9018	0.0000	2.0000	6.0000	6.0000	6.0000
ladderRatio	318851.0000	63.1649	25068.5061	0.0000	0.2500	0.3330	0.5000	10009400.0000
elevator	318819.0000	0.5771	0.4940	0.0000	0.0000	1.0000	1.0000	1.0000
fiveYearsProperty	318819.0000	0.6456	0.4783	0.0000	0.0000	1.0000	1.0000	1.0000
subway	318819.0000	0.6011	0.4897	0.0000	0.0000	1.0000	1.0000	1.0000
district	318851.0000	6.7636	2.8126	1.0000	6.0000	7.0000	8.0000	13.0000
communityAverage	318388.0000	63682.4463	22329.2154	10847.0000	46339.0000	59015.0000	75950.0000	183109.0000

	url	id	Lng	Lat	Cid	tradeTime	DOM	totalPrice	price	square	livingRoom	drawingRoom	kitchen	bathRoom	floor	buildingType	constructionTime	renovationCondition	buildingStructure	ladderRatio	elevator	fiveYearsProperty	subway	district	communityAverage
94502	https://bj.lianjia.com/chengjiao/BJCP84309432....	BJCP84309432	116.3866	40.0865	1111027382546	2002-06-01	NaN	152.5000	17053	89.4300	2	2	1	1	ÖÐ 9	4.0000	2009	1	6	0.3330	1.0000	0.0000	0.0000	6	47574.0000
127021	https://bj.lianjia.com/chengjiao/BJCY84594947....	BJCY84594947	116.4823	40.0142	1111027380302	2002-07-06	NaN	325.0000	27041	120.1900	3	1	1	1	¸ß 26	1.0000	2002	1	6	0.2500	1.0000	1.0000	1.0000	7	65595.0000
126849	https://bj.lianjia.com/chengjiao/BJCY84582560....	BJCY84582560	116.6035	39.9197	1111027377416	2002-07-10	NaN	186.0000	14210	130.9000	3	1	1	2	¸ß 21	1.0000	2001	1	6	0.3330	1.0000	0.0000	1.0000	7	43307.0000
223738	https://bj.lianjia.com/chengjiao/BJFT85137402....	BJFT85137402	116.2992	39.8381	1111027380071	2003-07-31	NaN	160.7000	26923	59.6900	2	1	1	1	ÖÐ 28	1.0000	2005	1	6	0.9090	1.0000	0.0000	1.0000	2	48853.0000
93197	https://bj.lianjia.com/chengjiao/BJCP00426371....	BJCP00426371	116.3302	40.0742	1111027381003	2008-10-07	NaN	126.0000	22195	56.7700	1	1	1	1	¸ß 20	3.0000	Î´Öª	1	6	0.3330	1.0000	0.0000	1.0000	6	50205.0000

	url	id	Lng	Lat	Cid	tradeTime	DOM	totalPrice	price	square	livingRoom	drawingRoom	kitchen	bathRoom	floor	buildingType	constructionTime	renovationCondition	buildingStructure	ladderRatio	elevator	fiveYearsProperty	subway	district	communityAverage	year	month	floorType	floorHeight
94502	https://bj.lianjia.com/chengjiao/BJCP84309432....	BJCP84309432	116.3866	40.0865	1111027382546	2002-06-01	NaN	152.5000	17053	89.4300	2	2	1	1	ÖÐ 9	4.0000	2009	1	6	0.3330	1.0000	0.0000	0.0000	6	47574.0000	2002	6	ÖÐ	9
127021	https://bj.lianjia.com/chengjiao/BJCY84594947....	BJCY84594947	116.4823	40.0142	1111027380302	2002-07-06	NaN	325.0000	27041	120.1900	3	1	1	1	¸ß 26	1.0000	2002	1	6	0.2500	1.0000	1.0000	1.0000	7	65595.0000	2002	7	¸ß	26
126849	https://bj.lianjia.com/chengjiao/BJCY84582560....	BJCY84582560	116.6035	39.9197	1111027377416	2002-07-10	NaN	186.0000	14210	130.9000	3	1	1	2	¸ß 21	1.0000	2001	1	6	0.3330	1.0000	0.0000	1.0000	7	43307.0000	2002	7	¸ß	21
223738	https://bj.lianjia.com/chengjiao/BJFT85137402....	BJFT85137402	116.2992	39.8381	1111027380071	2003-07-31	NaN	160.7000	26923	59.6900	2	1	1	1	ÖÐ 28	1.0000	2005	1	6	0.9090	1.0000	0.0000	1.0000	2	48853.0000	2003	7	ÖÐ	28
93197	https://bj.lianjia.com/chengjiao/BJCP00426371....	BJCP00426371	116.3302	40.0742	1111027381003	2008-10-07	NaN	126.0000	22195	56.7700	1	1	1	1	¸ß 20	3.0000	Î´Öª	1	6	0.3330	1.0000	0.0000	1.0000	6	50205.0000	2008	10	¸ß	20

	count	mean	std	min	25%	50%	75%	max
communityAverage	53.00	183109.00	0.00	183109.00	183109.00	183109.00	183109.00	183109.00
Cid	53.00	1111027374315.00	0.00	1111027374315.00	1111027374315.00	1111027374315.00	1111027374315.00	1111027374315.00
totalPrice	53.00	786.18	610.49	55.00	500.00	610.00	905.00	3840.00
price	53.00	92479.87	32431.31	15546.00	71392.00	87597.00	120250.00	147503.00
year	53.00	2014.38	1.38	2011.00	2014.00	2015.00	2015.00	2016.00
month	53.00	5.72	3.52	1.00	3.00	5.00	9.00	12.00
district	53.00	10.00	0.00	10.00	10.00	10.00	10.00	10.00
buildingType	53.00	3.04	0.27	3.00	3.00	3.00	3.00	5.00
buildingStructure	53.00	5.91	0.69	1.00	6.00	6.00	6.00	6.00

	count	mean	std	min	25%	50%	75%	max
communityAverage	14.00	146634.00	0.00	146634.00	146634.00	146634.00	146634.00	146634.00
Cid	14.00	1111027382080.00	0.00	1111027382080.00	1111027382080.00	1111027382080.00	1111027382080.00	1111027382080.00
totalPrice	14.00	511.07	175.70	179.00	454.25	562.50	646.50	738.00
price	14.00	99890.86	36226.53	32966.00	74412.50	103056.00	131861.25	149276.00
year	14.00	2014.57	1.34	2011.00	2014.00	2015.00	2015.00	2016.00
month	14.00	7.21	3.64	1.00	5.00	7.50	9.75	12.00
district	14.00	10.00	0.00	10.00	10.00	10.00	10.00	10.00
buildingType	14.00	4.00	0.00	4.00	4.00	4.00	4.00	4.00
buildingStructure	14.00	2.00	0.00	2.00	2.00	2.00	2.00	2.00

	url	id	Lng	Lat	Cid	tradeTime	DOM	totalPrice	price	square	livingRoom	drawingRoom	kitchen	bathRoom	floor	buildingType	constructionTime	renovationCondition	buildingStructure	ladderRatio	elevator	fiveYearsProperty	subway	district	communityAverage	year	month
92304	https://bj.lianjia.com/chengjiao/BJ0000615229....	BJ0000615229	116.1170	39.9339	1111052073923	2010-04-17	1.0000	90.0000	14288	62.9900	#NAME?	ÖÐ 6	4	1990	»ìºÏ½á¹¹	0.3330	0	0	0	0.0000	NaN	NaN	NaN	12	39381.0000	2010	4
92356	https://bj.lianjia.com/chengjiao/BJ0000615469....	BJ0000615469	116.1170	39.9339	1111052073923	2010-10-21	1.0000	126.8000	15849	80.0100	#NAME?	µÍ 6	4	1990	»ìºÏ½á¹¹	0.3330	0	0	0	0.0000	NaN	NaN	NaN	12	39381.0000	2010	10
124056	https://bj.lianjia.com/chengjiao/BJCY84338384....	BJCY84338384	116.3805	40.0252	1111042784632	2012-02-27	1.0000	698.3000	25300	276.0000	4	2	2	6	¶¥ 18	4.0000	2012	1	6	1.0000	1.0000	0.0000	1.0000	7	90147.0000	2012	2
93008	https://bj.lianjia.com/chengjiao/BJCP00382717....	BJCP00382717	116.3036	40.0909	1111027377758	2012-12-24	NaN	1140.0000	45950	248.1000	7	2	1	6	µ× 3	NaN	2003	1	4	0.5000	0.0000	0.0000	0.0000	6	56117.0000	2012	12

	url	id	Lng	Lat	Cid	tradeTime	DOM	totalPrice	price	square	livingRoom	drawingRoom	kitchen	bathRoom	floor	buildingType	constructionTime	elevator	fiveYearsProperty	subway	district	communityAverage	year	month
92235	https://bj.lianjia.com/chengjiao/BJ0000615040....	BJ0000615040	116.1177	39.9412	1111052110644	2010-02-21	1.0000	198.0000	15729	125.8900	#NAME?	ÖÐ 14	3	2006	¸Ö»ì½á¹¹	0.5000	1	NaN	NaN	NaN	12	39492.0000	2010	2
92251	https://bj.lianjia.com/chengjiao/BJ0000615084....	BJ0000615084	116.1220	39.9398	1111027375862	2010-03-07	1.0000	136.0000	14601	93.1500	#NAME?	ÖÐ 15	4	2003	¸Ö»ì½á¹¹	0.3330	1	NaN	NaN	NaN	12	37588.0000	2010	3
92267	https://bj.lianjia.com/chengjiao/BJ0000615132....	BJ0000615132	116.1184	39.9379	1111043529456	2010-03-22	1.0000	127.0000	14613	86.9100	#NAME?	ÖÐ 16	4	2006	¸Ö»ì½á¹¹	0.3330	1	NaN	NaN	NaN	12	38753.0000	2010	3
92270	https://bj.lianjia.com/chengjiao/BJ0000615138....	BJ0000615138	116.1113	39.9499	1111027377794	2010-03-23	1.0000	115.0000	14240	80.7600	#NAME?	ÖÐ 6	4	2003	»ìºÏ½á¹¹	0.5000	0	NaN	NaN	NaN	12	36923.0000	2010	3
92297	https://bj.lianjia.com/chengjiao/BJ0000615208....	BJ0000615208	116.1177	39.9412	1111052110644	2010-04-10	1.0000	100.0000	17242	58.0000	#NAME?	¸ß 14	3	2006	¸Ö»ì½á¹¹	0.1250	1	NaN	NaN	NaN	12	39492.0000	2010	4

	count	mean	std	min	25%	50%	75%	max
Lng	8.00	116.37	0.00	116.37	116.37	116.37	116.37	116.37
Lat	8.00	40.08	0.00	40.08	40.08	40.08	40.08	40.08
Cid	8.00	117800835289894.00	0.00	117800835289894.00	117800835289894.00	117800835289894.00	117800835289894.00	117800835289894.00
DOM	8.00	71.12	56.24	17.00	32.00	43.00	114.25	169.00
followers	8.00	42.88	40.15	0.00	22.75	33.50	44.00	133.00
totalPrice	8.00	531.86	87.26	420.00	486.00	519.45	550.75	715.00
price	8.00	51673.75	6277.83	41380.00	47969.00	52531.00	56265.75	59941.00
square	8.00	103.32	14.30	90.09	93.13	95.51	116.00	127.68
livingRoom	8.00	2.25	0.46	2.00	2.00	2.00	2.25	3.00
drawingRoom	8.00	1.38	0.52	1.00	1.00	1.00	2.00	2.00
kitchen	8.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00
bathRoom	8.00	1.38	0.52	1.00	1.00	1.00	2.00	2.00
buildingType	8.00	3.00	0.00	3.00	3.00	3.00	3.00	3.00
constructionTime	8.00	2015.62	0.52	2015.00	2015.00	2016.00	2016.00	2016.00
renovationCondition	8.00	2.00	1.07	1.00	1.00	2.00	2.25	4.00
buildingStructure	8.00	6.00	0.00	6.00	6.00	6.00	6.00	6.00
ladderRatio	8.00	0.47	0.34	0.25	0.25	0.25	0.62	1.00
elevator	8.00	0.25	0.46	0.00	0.00	0.00	0.25	1.00
fiveYearsProperty	8.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
subway	8.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00
district	8.00	6.00	0.00	6.00	6.00	6.00	6.00	6.00
communityAverage	8.00	54570.12	1618.21	53998.00	53998.00	53998.00	53998.00	58575.00
year	8.00	2016.62	0.52	2016.00	2016.00	2017.00	2017.00	2017.00
month	8.00	8.75	2.19	5.00	7.75	9.00	9.50	12.00
floorType	8.00	1.38	1.77	0.00	0.00	1.00	1.50	5.00
floorHeight	8.00	19.12	1.55	18.00	18.00	18.00	21.00	21.00